Kenneth Tay
Oct 11, 2018
ggplot2
ggplot2 syntaxlibrary(ggplot2)
ggplot()ggplot2 syntaxggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp))ggplot2 syntaxggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp)) +
    geom_point(data = mtcars, 
               mapping = aes(x = factor(cyl), y = hp),
               position = "jitter")ggplot2 syntaxggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter")ggplot2 syntaxggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter") +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower")ggplot2 syntaxggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter") +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower") +
    theme_classic()dplyr (and %>% syntax)We rarely get data in exactly the form we need!
Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).
dplyr verbsselect(): pick variables by their namesmutate(): create new variables based on existing onesarrange(): reorder rowsfilter(): pick observations by their valuessummarize(): collapse many values down to a single summaryscores##     Name English Math Science History Spanish
## 1 Andrew      60   96      80      56      77
## 2   John      66   55      56      64      77
## 3   Mary      92   63      70      62      98
## 4   Jane      80   76      89      55      40
## 5    Bob      80   80      82      48      50
## 6    Dan      58   52      79      90      61select: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
scores dataset.mutate: create new columns based on old onesForm teacher: “What are their total scores?”
scores dataset.arrange: reorder rowsForm teacher: “Can I have the students in order of overall performance?”
scores dataset.arrange: reorder rowsForm teacher: “No no, better students on top please…”
scores dataset.arrange: reorder rowsForm teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
scores dataset.filter: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
scores dataset.summarize: get summaries of dataAcademic: “I want to know the correlation between math and science scores”
scores dataset.Science teacher: “I want to know the mean and standard deviation of the scores for science”
scores dataset.group_by: use dplyr verbs on a group-by-group basisAcademic: “I want to know if the boys scored better than the girls in Spanish”
scores dataset.Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”
Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”
scores dataset.Math teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”
Math teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”
scores dataset.select: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
scores dataset.scores %>%
    select(Name, History)##     Name History
## 1 Andrew      56
## 2   John      64
## 3   Mary      62
## 4   Jane      55
## 5    Bob      48
## 6    Dan      90mutate: create new columns based on old onesForm teacher: “What are their total scores?”
scores dataset.scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)
scores##     Name English Math Science History Spanish Total
## 1 Andrew      60   96      80      56      77   369
## 2   John      66   55      56      64      77   318
## 3   Mary      92   63      70      62      98   385
## 4   Jane      80   76      89      55      40   340
## 5    Bob      80   80      82      48      50   340
## 6    Dan      58   52      79      90      61   340arrange: reorder rowsForm teacher: “Can I have the students in order of overall performance?”
scores dataset.scores %>%
    arrange(Total)##     Name English Math Science History Spanish Total
## 1   John      66   55      56      64      77   318
## 2   Jane      80   76      89      55      40   340
## 3    Bob      80   80      82      48      50   340
## 4    Dan      58   52      79      90      61   340
## 5 Andrew      60   96      80      56      77   369
## 6   Mary      92   63      70      62      98   385arrange: reorder rowsForm teacher: “No no, better students on top please…”
scores dataset.scores %>%
    arrange(desc(Total))##     Name English Math Science History Spanish Total
## 1   Mary      92   63      70      62      98   385
## 2 Andrew      60   96      80      56      77   369
## 3   Jane      80   76      89      55      40   340
## 4    Bob      80   80      82      48      50   340
## 5    Dan      58   52      79      90      61   340
## 6   John      66   55      56      64      77   318arrange: reorder rowsForm teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
scores dataset.scores %>%
    arrange(desc(Total), Name)##     Name English Math Science History Spanish Total
## 1   Mary      92   63      70      62      98   385
## 2 Andrew      60   96      80      56      77   369
## 3    Bob      80   80      82      48      50   340
## 4    Dan      58   52      79      90      61   340
## 5   Jane      80   76      89      55      40   340
## 6   John      66   55      56      64      77   318filter: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
scores dataset.scores %>%
    filter(History < 60)##     Name English Math Science History Spanish Total
## 1 Andrew      60   96      80      56      77   369
## 2   Jane      80   76      89      55      40   340
## 3    Bob      80   80      82      48      50   340Other ways to make comparisons:
>: greater than<: less than>=: greater than or equal to<=: less than or equal to!=: not equal to==: equal to (Do not use = to test for equality!!)Combining comparisons:
!: not&: and|: orfilter examplesDan’s parents: “I just want Dan’s scores”
scores %>% 
    filter(Name == "Dan")##   Name English Math Science History Spanish Total
## 1  Dan      58   52      79      90      61   340Language teacher: “I want to know which students score < 50 for either English or Spanish”
scores %>% 
    filter(English < 50 | Spanish < 50)##   Name English Math Science History Spanish Total
## 1 Jane      80   76      89      55      40   340summarize: get summaries of dataAcademic: “I want to know the correlation between math and science scores”
scores dataset.scores %>%
    summarize(corr = cor(Math, Science))##        corr
## 1 0.5470561summarize: get summaries of dataScience teacher: “I want to know the mean and standard deviation of the scores for science”
scores dataset.scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))##   Science_mean Science_sd
## 1           76   11.54123group_by: use dplyr verbs on a group-by-group basisAcademic: “I want to know if the boys scored better than the girls in Spanish”
scores dataset.scores %>%
    group_by(Gender) %>%
    summarize(Spanish_mean = mean(Spanish))## # A tibble: 2 x 2
##   Gender Spanish_mean
##   <chr>         <dbl>
## 1 F              69  
## 2 M              66.2dplyr commandsLanguage teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”
scores dataset.scores %>%
    filter(English < 70 & Spanish < 70) %>%
    select(Name)##   Name
## 1  Dandplyr commandsMath teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”
scores dataset.scores %>%
    filter(Math < 70) %>%
    mutate(Mean = (English + Math + Science + History + Spanish)/5) %>%
    select(Name, Mean)##   Name Mean
## 1 John 63.6
## 2 Mary 77.0
## 3  Dan 68.0
 Optional material
transmute: create new columns based on old ones, discard old onesForm teacher: “I just want the mean score for each student”
scores %>% 
    transmute(mean = (English + Math + Science + History + Spanish) / 5)How does R understand the code filter(History < 60)?
History less than 60 or not?
History < 60 is a statement that is either TRUE or FALSETRUE, keep the rowfilter(<condition>) only returns the rows for which <condition> is TRUETRUE or FALSE: boolean expression3 > 2## [1] TRUE3 < 2## [1] FALSE3 == 2## [1] FALSEc(1, 2, 3, 1) == c(3, 2, 1, 2)## [1] FALSE  TRUE FALSE FALSEc(1, 2, 3, 1) == 1## [1]  TRUE FALSE FALSE  TRUENAs!1 == NA## [1] NANA == NA## [1] NAis.na(NA)## [1] TRUE